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Commentaires sur « Design and performance evaluation 
of load distribution strategies for multiple loads on 
heterogeneous linear daisy chain networks » 

Resume : Min, Veeravalli, and Barlas ont propose [8, 9] des strategies pour minimiser le 
temps d'execution d'une ou de plusieurs taches divisibles sur un reseau lineaire de processeurs 
heterogenes, en distribuant Ic travail en une ou plusieurs tournees. Sur un exemple tres 
simple nous montrons que I'approchc proposee dans [9] ne produit pas toujours une solution 
et que, quand elle le fait, la solution est souvent sous-optimale. Nous montrons egalement 
comment trouver un ordonnancement optimal pour toute instance, quand le nombre de 
tournees par taches est specifie. Finalement, nous montrons formellement que lorsque les 
fonctions de coiits sont lineaires, comme c'est le cas dans [8, 9], un ordonnancement optimal 
a un nombre infini de tournees. Un tel modele de coiit ne pent done pas etre utilise pour 
definir des strategies en multi-tournees utilisables en pratique. 

Mots-cles : ordonnancement, ressources heterogenes, taches divisibles, tournees. 
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1 Introduction 

Min, Veeravalli and Barlas proposed [8, 9] strategies to minimize the overall execution time 
of one or several divisible loads on a heterogeneous linear network. Initially, the authors 
targeted single-installment strategies, that is strategies under which a processor receives in 
a single communication all its share of a given load. When they were not able to design 
single-installment strategics, they proposed multi-installment ones. 

In this research note, we first show on a very simple example that the approach proposed 
in [9] does not always produce a solution and that, when it does, the solution is often 
suboptimal. The fundamental flaw of the approach of [9] is that the authors are optimizing 
the scheduling load by load, instead of attempting a global optimization. The load by load 
approach is suboptimal and overconstrains the problem. 

On the contrary, wc show how to find an optimal scheduling for any instance, once the 
number of installments per load is given. In particular, our approach always find the optimal 
solution in the single-installment case. Finally, we formally prove that under a linear cost 
model for communication and communication, as in [8, 9], an optimal schedule has an infinite 
number of installments. Such a cost model can therefore not be used to design practical 
multi-installment strategies. 

Please refer to the papers [8, 9] for a detailed introduction to the optimization problem 
under study. We briefly recall the framework in Section 2, and we deal with an illustrative 
example in Section 3. Then we directly proceed to the design of our solution (Section 4), 
we discuss its possible extensions and the linear cost model (Section 5), before concluding 
(Section 6). 

2 Problem and Notations 

We summarize here the framework of [8, 9]. The target architecture is a linear chain of 
m processors (Pi, P2, • ■ • , -Pm)- Processor Pi is connected to processor P^+i by the commu- 
nication link li (see Figure 1). The target application is composed of N loads, which are 
divisible, which means that each load can be split into an arbitrary number of chunks of any 
size, and these chunks can be processed independently. All the loads are initially available 
on processor Pi, which processes a fraction of them and delegates (sends) the remaining 
fraction to P2. In turn, P2 executes part of the load that it receives from Pi and sends 
the rest to P3, and so on along the processor chain. Communications can be overlapped 
with (independent) computations, but a given processor can be active in at most a single 
communication at any time-step: sends and receives are serialized (this is the full one-port 
model). 

Since the last processor P,„ cannot start computing before having received its first mes- 
sage, it is useful for Pi to distribute the loads in several installments: the idle time of remote 
processors in the chain will be reduced due to the fact that communications are smaller in 
the first steps of the overall execution. 
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Figure 1: Linear network, with m processors and m — 1 links. 



We deal with the general case in which the nth load is distributed in Qn installments of 
different sizes. For the j'th installment of load n, processor Pi takes a fraction 7"(j), and 
sends the remaining part to the next processor while processing its own fraction. 

In the framework of [8, 9], loads have different characteristics. Every load n (with 
1 < n < N) is defined by a volume of data Vcomm{n) and a quantity of computation 
ycompin). Moreover, processors and links arc not identical cither. We let Wi be the time 
taken by Pi to compute a unit load (1 < i < m), and Zi be the time taken by Pi to send 
a unit load to Pi+i (over link Z^, 1 < i < m — 1). Note that we assume a linear model for 
computations and communications, as in the original articles, and as is often the case in 
divisible load literature [7, 4]. 

For the jth installment of the nth load, let Commf^'^J denote the starting time of 
the communication between Pi and Pi+i, and let Commf^j denote its completion time; 
similarly, Compf^'^J denotes the start time of the computation on Pi for this installment, and 
Compi'^j denotes its completion time. The objective function is to minimize the makespan, 
i.e., the time at which all loads are computed. For the sake of convenience, all notations are 
summarized in Table 1. 



3 An illustrative example 
3.1 Presentation 

To show the limitations of [8, 9], we deal with a simple illustrative example. We use 2 
identical processors Pi and P2 with wi = = A, and z(l) = 1. We consider N = 2 identical 
divisible loads to process, with Vcomm(l) = Komm(2) = 1 and V;,omp(l) = V'comp(2) = 
1. Note that when A is large, communications become negligible and each processor is 
expected to process around half of both loads. But when A is close to 0, communications 
are very important, and the solution is not obvious. To ease the reading, we only give a 
short (intuitive) description of the schedules, and provide their different makcspans without 
justification (we refer the reader to Appendix A for all proofs). 

We first consider a simple schedule which uses a single installment for each load, as 
illustrated in Figure 2. Processor Pi computes a fraction 7^(1) = 2A^+2A+i ^^^'^ ^^^^ load. 
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Time taken by processor Pi to compute a unit load. 
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Availability date of Pi (time at which it becomes available for processing the loads). 


N 
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Q„ 


Total number of installments for nth load. 


r comm \ "' J 


Volume of data for Tith load. 




Volume of computation for rtth load. 




Fraction of 72th load computed on processor Pi during the ^th installment. 




Start time of communication from processor Pi to processor Pi+i 
for jth installment of nth load. 




End time of communication from processor Pi to processor Pi+i 
for jth installment of nth load. 


ComptX/ 


Start time of computation on processor Pi 
for jth installment of nth load. 


ComptXj 


End time of computation on processor Pi 
for jth installment of nth load. 



Table 1: Summary of notations. 
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Figure 2: The example schedule, with A = a is 72(1) and (3 is 72(2). 
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ijq-j- of the second load. Then the second processor computes a 
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load. The makespan achieved by this schedule is equal to makespauj^ = 2A^+2A+i'^ 
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Figure 3: The schedule of [9] for A = 2, with a = 72^(1) and f3 = j^{2). 



3.2 Solution of [9], one-installment 

In the solution of [9] , Pi and P2 have to simultaneously complete the processing of their share 
of the first load. The same holds true for the second load. We are in the one-installment 
case when Pi is fast enough to send the second load to P2 while it is computing the first 
load. This condition writes A > ~ 1.366. 

In the solution of [9], Pi processes a fraction 7|(1) = ^x+i ^^^^ load, and a 

fraction 7^ (2) = 5 of the second one. P2 processes a fraction 72 (1) = jXTi ^^^^ lo&d 

Li, and a fraction 72(2) = 5 of the second one. The makespan achieved by this schedule is 
makespan2 = 2(2 a+i) • 

Comparing both makespans, we have < makespauj — makespan < i, the solution 
of [9] having a strictly larger makespan, except when A ~ ^'^^ . Intuitively, the solution 
of [9] is worse than the schedule of Section 3.1 because it aims at locally optimizing the 
makespan for the first load, and then optimizing the makespan for the second one, instead 
of directly searching for a global optimum. A visual representation of this case is given in 
Figure 3 for A = 2. 



3.3 Solution of [9], multi-installment 

The solution of [9] is a multi-installment strategy when A < , i.e., when communications 
tend to be important compared to computations. More precisely, this case happens when 
Pi does not have enough time to completely send the second load to P2 before the end of 
the computation of the first load on both processors. 

The way to proceed in [9] is to send the second load using a multi-installment strategy. 
Let Q denote the number of installments for this second load. We can easily compute the size 
of each fraction distributed to Pi and P2. Processor Pi has to process a fraction 7^ (1) = 
of the first load, and fractions 7^(2), 7^(2), . . . ,7^(2) of the second one. Processor P2 has 
a fraction 72(1) = 2XT1 °^ ^^"^ ^^'^^ load, and fractions 72(2), 7|(2), . . . ,7^(2) of the second 
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Figure 4: The example with A = |, a = 72(1) and fi = 72(2). 

one. Moreover, we have the following equality for 1 < < Q: 

7^(2) = 72^2) = AS1(1). 

And for k^Q (the last installment), we have -f^{2) = 7^(2) < X^-fl{l). Let (3k = 7^(2) = 
72 (2). We can then establish an upper boimd on the portion of the second load distributed 
in Q installments: 

E(2/5..)<2^(72\1)A'=) = ^"'^' 

k=l k=l 



2A2 - A - 1 



if A ^ 1, and Q = 2 otherwise. 
We have three cases to discuss: 

1. < A < « 0.64: Since A < 1, we can write for any nonncgative integer Q: 



£(2/3fe)<^(2A) 



2A2 



fe=i 



k=l 



(1- A)(2A + 1) 



We have (i_a^('2a+i) ^ ^ f'^^ ^ ^ ^^^^ . So, even in the case of an infinite number 
of installments, the second load will not be completely processed. In other words, no 
solution is found in [9] for this case. A visual representation of this case is given in 
Figure 4 with A = 0.5. 

2. A = ^^2Z+i: We have ,V(l 



(i-A)(2A+i) = Ij SO an infinite number of installments is required 
to completely process the second load. Again, this solution is obviously not feasible. 

3. '^g^^^ < A < '^^^ : In this case, the solution of [9] is better than any solution using 
a single installment per load, but it may require a very large number of installments. 
A visual representation of this case is given in Figure 5 with A = 1. 
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Figure 5: The example with A = 1, a = 72(1) and /3 = 72(2). 



In this case, the number of installments is set in [9] as Q = • ^'^^ ^h^i 

this choice is not optimal, consider the case A = |. The algorithm of [9] achieves a 
makespan equal to (l — 72(1)) + t ~ t!)- "^^^ f'.i'st load is sent in one installment 
and the second one is sent in 3 installments (according to the previous equation). 

However, we can come up with a better schedule by splitting both loads into two 
installments, and distributing them as follows: 

during the first round, P\ processes unit of the first load, 

Dnd round, P\ processes ||| unit of the first load, 
t round, P2 processes i|| unit of the first load. 
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the 


during 


the 


during 
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during 


the 


during 


the 


during 


the 


during 


the 


during 


the 



144 
653 



unit of the first load. 



653 """" """^ o^^v^xivi 

unit of the second load. 



during the second round, P2 processes ^ unit of the second load, 

This scheme gives us a total makespan equal to gf^f ~ 0.897, which is (slightly) 
better than 0.9. This shows that among the schedules having a total number of four 
installments, the solution of [9] is suboptimal. 

3.4 Conclusion 

Despite its simplicity (two identical processors and two identical loads), the analysis of this 
illustrative example clearly outlines the limitations of the approach of [9] : this approach does 
not always return a feasible solution and, when it does, this solution is not always optimal. 



INRIA 



Comments on "Design performance evaluation of load distribution strategies... 



9 



In the next section, we show how to compute an optimal schedule when dividing each load 
into any prescribed number of installments. 

4 Optimal solution 

We now show how to compute an optimal schedule, when dividing each load into any pre- 
scribed number of installments. Therefore, when this number of installment is set to 1 for 
each load (i.e., Qn = 1, for any n in [1,7V]), the following approach solves the problem 
originally target by Min, Veeravalli, and Barlas. 

To build our solution we use a linear programming approach. In fact, we only have to list 
all the (linear) constraints that must be fulfilled by a schedule, and write that we want to 
minimize the makespan. All these constraints are captured by the linear program in Figure 6. 
The optimality of the solution comes from the fact that the constraints are exactly all the 
constraints a schedule must fulfill, and a solution to the linear program is obviously always 
feasible. This linear program simply encodes the following constraints (where a number in 
brackets is the number of the corresponding constraint on Figure 6): 

• Pi cannot start a new communication to Pi before the end of the corresponding com- 
munication from Pi-i to Pi (1), 

• Pi cannot start to receive the next installment of the nth load before having finished 
to send the current one to Pi+i (2), 

• Pi cannot start to receive the first installment of the next load before having finished 
to send the last installment of the current load to Pi+i (3), 

• any transfer has to begin at a nonnegative time (4), 

• the duration of any transfer is equal to the product of the time taken to transmit a 
unit load (5) by the volume of data to transfer, 

• processor Pi cannot start to compute the jth installment of the nth load before having 
finished to receive the corresponding data (6), 

• the duration of any computation is equal to the product of the time taken to compute 
a unit load (7) by the volume of computations, 

• processor Pi cannot start to compute the first installment of the next load before it 
has completed the computation of the last installment of the current load (8), 

• processor Pi cannot start to compute the next installment of a load before it has 
completed the computation of the current installment of that load (9), 

• processor Pi cannot start to compute the first installment of the first load before its 
availability date (10), 
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yi < m - l,n < N,j < Qn 


CommtlY,i,:i 


> 


Comml^^j 


(1) 


Wi < m - l,n < N,j < Q„ 




> 


/-t end 


(2) 


yi <m- l,n < N 




> 


end 


(3) 


Wi < m - l,n < N,j < Qn 




> 





(4) 


\/i<m — l,n<N,j< Q„ 


Commllfj 




y— f start 1 T 7 

I / HTD TD ■ -\- 7< 

\^ ui 1 LI 1 L^ Yi j fji V comm 


m 

(n)V^ -/Un) (5) 

fc = i + l 


\/i > 2,n < N,j < Q„ 


Compt^Zl 


> 


r-^ end 


(6) 


yi < m,n < N,j < Q„ 






Comp^^^j + Wi'yf (n) 


Veaic{n) (7) 


\/i < m,n < N 




> 


^ end 

Comp,^„,Qr, 


(8) 


yi < m,n < N,j < Q„ 




> 


Comptl, 


(9) 


\/i <m 


Compile:,' 


> 


Ti 


(10) 


yi < m,n < N,j < Q„ 




> 





(11) 


yn<N 


EIliE^iifW 




1 


(12) 


yi < m 


makespan 


> 


^ end 

Comp.^i^^Q 


(13) 



Figure 6: The complete linear program. 

• every portion of a load dedicated to a processor is necessarily nonnegative (11), 

• any load has to be completely processed (12), 

• the makespan is no smaller than the completion time of the last installment of the last 
load on any processor (13). 

Altogether, we have a linear program to be solved over the rationals, hence a solution in 
polynomial time [G]. In practice, standard packages like Maple [3] or GLPK [5] will return 
the optimal solution for all reasonable problem sizes. 

Note that the linear program gives the optimal solution for a prescribed number of 
installments for each load. We will discuss the problem of the number of installments in the 
next section. 

5 Possible extensions 

There are several restrictions in the model of [9] that can be alleviated. First the model uses 
uniform machines, meaning that the speed of a processor does not depend on the task that it 
executes. It is easy to extend the linear program for unrelated parallel machines, introducing 
to denote the time taken by Pi to process a unit load of type n. Also, all processors 
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and loads arc assumed to be available from the beginning. In our linear program, we have 
introduced availability dates for processors. The same way, we could have introduced release 
dates for loads. Furthermore, instead of minimizing the makespan, we could have targeted 
any other objective function which is an affine combination of the loads completion time and 
of the problem characteristics, like the average completion time, the maximum or average 
(weighted) flow, etc. 

The formulation of the problem does not allow any piece of the n'th load to be processed 
before the nth load is completely processed, if n' > n. We can easily extend our solution 
to allow for N rounds of the N loads, each load being still divided into several installments. 
This would allow to interleave the processing of the different loads. 

The divisible load model is linear, which causes major problems for multi-installment 
approaches. Indeed, once we have a way to find an optimal solution when the number 
of installments per load is given, the question is: what is the optimal number of install- 
ments? Under a linear model for communications and computations, the optimal number 
of installments is infinite, as the following theorem states: 

Theorem 1. Let us consider, under a linear cost model for communications and computa- 
tions, an instance of our problem with one or more load and at least two processors. Then, 
any schedule using a finite number of installments is suboptimal for makespan minimization. 

This theorem is proved by building, from any schedule, another schedule with a strictly 
smaller makespan. The proof is available in Appendix B. 

An infinite number of installments obviously does not define a feasible solution. Moreover, 
in practice, when the number of installments becomes too large, the model is inaccurate, as 
acknowledged in [2, p. 224 and 276]. Any commimication incurs a startup cost K, which we 
express in bytes. Consider the nth load, whose communication volume is Vcommin): it is split 
into Qn installments, and each installment requires m— 1 communications. The ratio between 
the actual and estimated communication costs is roughly equal to p = > 
1. Since K, m, and Vcomm are known values, we can choose Qn such that p is kept relatively 
small, and so such that the model remains valid for the target application. Another, and more 
accurate solution, would be to introduce latencies in the model, as in [1]. This latter article 
shows how to design asymptotically optimal multi-installment strategies for star networks. 
A similar approach should be used for linear networks. 

6 Conclusion 

We have shown that a linear programming approach allows to solve all instances of the 
scheduling problem addressed in [8, 9]. In contrast, the original approach was providing a 
solution only for particular problem instances. Moreover, the linear programming approach 
returns an optimal solution for any number of installments, while the original approach was 
empirically limited to very special strategies, and was often sub-optimal. 

Intuitively, the solution of [9] is worse than the schedule of Section 3.1 because it aims 
at locally optimizing the makespan for the first load, and then optimizing the makespan for 
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the second one, and so on. instead of directly searching for a global optimum. We did not 
find beautiful closed- form expressions defining optimal solutions but, through the power of 
linear programming, we were able to find an optimal schedule for any instance. 



A Analytical computations for the illustrative example 

In this appendix, we prove the results stated in Sections 3.2 and 3.3. In order to simplify 
equations, we write a instead of 72(1) (i-C-, ex. is the fraction of the first load sent from the 
first processor to the second one), and (3 instead of 7|(1) (similarly, /? is the fraction of the 
second load sent to the second processor). 

In this research note we used simpler notations than the ones used in [9]. However, as we 
want to explicit the solutions proposed by [9] for our example, we need to use the original 
notations to enable the reader to double-check our statements. The necessary notations 
from [9] are recalled in Table 2. 



rpn 

-'- cp 


Time taken by the standard processor (w = 1) to compute the load Ln. 


rpn 

-'- cm 


Time taken by the standard link (z = 1) to communicate the load Ln. 


Lji 


Size of the nth load, where 1 < 7i < N. 




Portion of the load L„ assigned to the fcth installment for processing. 




The fraction of the total load Lk.n to Pi, where 

< ai';] < 1, Vi = 1, . . . ,m and 1 c^} = 1- 


tk,n 


The time instant at which is initiated the first communication for the fcth installment 

of load Ln (Lk.n)- 


Ck,n 


The total communication time of the fcth installment of load Ln when Lk.n ~ 1; 

Ck,n = % E"=l' (1 - Ej = l aij) • 


Ek,n 


The total processing time of Pm for the fcth installment of load Ln when Lk.n ~ 1; 

p ^(^) 1 

^k.n — i^n.m f^m J- cp • 


T{k,n) 


The finish time of the fcth installment of load L„; it is defined as the time instant 
at which the processing of the fcth installment of load L„ ends. 


Tin) 


The finish time of the load Z/„; it is defined as the time instant 

at which the processing of the nth load ends, i.e., T{n) — T{Qn) 

where Qn is the total number of installments required to finish processing load Ln- 

T{N) is the finish time of the entire set of loads resident in Pi . 



Table 2: Summary of the notations of [9] used in this paper. 



In the solution of [9], both Pi and P2 have to finish the first load at the same time, 
and the same holds true for the second load. The transmission for the first load will take 
a time units, and the one for the second load /? time units. Since Pi (respectively P2) will 
process the first load during A(l — a) (respectively Aa) time units and the second load during 
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A(l — (3) (respectively A/3) time units, we can write the following equations: 

A(l-a) = a + Xa (14) 

A(l -a) + A(l -/3) (a + max(/3,Aa)) + A/3 

There are two cases to discuss: 

1. max(/3, Xa) = Xa. We are in the one-installment case when L2Ci^2 < T{\) — ti_2, i-e., 
P < A(l — a) — a (equation (5) in [9], where L2 = 1, C\^2 = /3, T(l) = A(l — a) and 
^1,2 = oi)- The values of a and (3 are given by: 

and j3 



2A+1 2 

This case is true for Aa > /?, i.e., 2X^1 >\^^> ^"r^ ~ 1-366. 
In this case, the makespan is equal to: 

makespaua = A(l - a) + A(l - P) = ^ ^| . 
Comparing both makespans, we have: 

A(2A2-2A-1) 



makcspan2 — makespauj^ 



8A3 + 12A2 + 8A + 2' 

1 



For all A > ^ J' w 1.366, our solution is better than their one. since: 



^ > makespauj — makespauj^ > 

Furthermore, the solution of [9] is strictly suboptimal for any A > ■ 

2. max(/3, Aa) = /?. In this case. Pi does not have enough time to completely send the 
second load to P2 before the end of the computation of the first load on both processors. 
The way to proceed in [9] is to send the second load using a multi-installment strategy. 

By using 14, we can compute the value of a: 

X 



2A + 1' 



Then we have T(l) = (1 — a)A ~ 2\+i ^^'^ ^i'2 = a = ^a+i ' i-*^-' communication 
for the second request begins as soon as possible. 

We know from equation (1) of [9] that 1 — 0^227 ^^^d by definition of the a's, 
= 1, so we have 02 i = 5- We also have Ci,2 = 1 — 02 1 = |, £^1,2 = f, 



a: 



k 



2,2 ~ °^ ""^ iicivc U2,i — 2- iiavc v^i^2 — "2,1 — 2 
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Wc will denote by /3i , . . . , /3„ the sizes of the different installments processed on each 
processor (then we have Lk,2 — 2/3^). 

Since the second processor is not left idle, and since the size of the first installment 
is such that the communication ends when P2 completes the computation of the first 
load, wc have /3i = 2^(1) — ti^2 = Aa (see equation (27) in [9], in which wc have 

Cl,2 = 2)- 

By the same way, we have /32 = A/3i, (3^ = A/32, and so on (see equation (38) in [9], we 
recall that -B = ^, and Ci^2 = \)'- 

f3k = X'^a 



Each processor computes the same fraction of the second load. If we have Q install- 
ments, the total processed portion of the second load is upper bounded as follows: 



k=l fe=l 

if A ^ 1, and Q = 2 otherwise 



k=l 



We have four sub-cases to discuss: 

(a) < A < ^^^^ ~ 0.64: Since A < 1, we can write for any nonncgativc integer Q: 

Q 00 „, 2 

g'^«<g'^«- (l-A)(2A + l) 

Wc have (i^x^lx+i) ^ ^ ^^'^ ^ ^ ^ ^s^^ ■ ^^^'^ '^^ infinite 

number of installments, the second load will not be completely processed. In 
other words, no solution is found in [9] for this case. 

(b) A = ^^g^^ ■ We have (i_a^('2a-i-i) = so an infinite number of installments is 
required to completely process the second load. Again, this solution is obviously 
not feasible. 

(c) < A < and A ^ 1: In this case, the solution of [9] is better than 
any solution using a single installment per load, but it may require a very large 
number of installments. 

Now, let us compute the number of installments. We know that the ith install- 
ment is equal to f3i = A*72(l), excepting the last one, which can be smaller than 
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A«7i(l). So, instead of writing E?=i 2A = (E?"' 2A*72 (1)) + 2/3q = 1, we 
write: 

° , , , , 2A» (AO - 1) 



4=^ 



2A'3+2 



> 



2A2 



(A - 1)(2A + 1) - (A - 1)(2A + 1) - (A - 1)(2A + 1) 



1. 



If A is strictly smaller than 1, we obtain: 

(A-1)(2A+1) ^ (A-1)(2A+1) + ^ ^ S4A-A-i 



<^ ln(A'3) < In 



f 4A^-A-] 



I 2A2 



^ Q ln(A) < In 



4A"'-A-l 
2A2 



In 



We thus obtain: 



Q - ln(A) 

"inl'" 



Q = 



4A^-A-l 



2A3 



ln(A) 



When A is strictly greater than 1 we obtain the exact same result (then A — 1 and 
ln(A) are both positive). 

(d) A = 1. In this case, 

Q 



simply leads to Q = 2. 



B Proof of Theorem 1 

Proof. Wc first remark that in any optimal solution to our problem all processors work and 
complete their share simultaneously. To prove this statement, wc consider a schedule where 
one processor completes its share strictly before the makespan (this processor may not be 
doing any work at all). Then, under this schedule there exists two neighbor processors. Pi 
and Pi+i, such that one finishes at the makespan, denoted M, and one strictly earlier. We 
have two cases to consider: 

1. There exists a processor Pi which finishes strictly before the makespan A4 and such 
that the processor P^+i completes its share exactly at time A4. Pi+i receives all 
the data it processes from P,;. We consider any installment j of any load L„ that 
is effectively processed by Pi+i (that is, Pi+i processes a non null portion of the 
jth installment of load L„). We modify the schedule as follows: Pi enlarges by an 
amount e, and Pi+i decreases by an amount e, the portion of the jth installment of 
the load L„ it processes. Then, the completion time of Pi is increased, and that of 
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Pi+i is decreased, by an amount proportional to e as our cost model is linear. If e 
is small enough, both processors complete their work strictly before A4. With our 
modification of the schedule, the size of a single communication was modified, and this 
size was decreased. Therefore, this modification did not enlarge the completion time of 
any processor except Pi. Therefore, the number of processors whose completion time 
is equal to M is decreased by at least one by our schedule modification. 

2. No processor which completes it share strictly before time Ai is followed by a processor 
finishing at time A4. Therefore, there exists an index i such that the processors Pi 
through Pi all complete their share exactly at M, and the processors Pi+i through P^ 
complete their share strictly earlier. Then, let the last data to be effectively processed 
by Pi be a portion of the jth installment of the load L„. Then Pi decreases by a size 
e, and P^+i increases by a size e, the portion of the jth installment of load Ln that 
it processes. Then the completion time of Pi is decreased by an amount proportional 
to e and the completion time of the processors Pi+i through Pm is increased by an 
amount proportional to e. Therefore, if e is small enough, the processors Pi through 
Pm complete their work strictly before M. 

In both cases, after we modified the schedule, there is at least one more processor which 
completes its work strictly before time Ai, and no processor is completing its share after 
that time. If no processor is any longer completing its share at time Ai, we have obtained 
a schedule with a better makespan. Otherwise, we just iterate our process. As the num- 
ber of processors is finite, we will eventually end up with a schedule whose makespan is 
strictly smaller than A4. Hence, in an optimal schedule all processors complete their work 
simultaneously (and thus all processors work). 

We now prove the theorem itself by contradiction. Let S be any optimal schedule using 
a finite number of installments. As processors P2 through P,„ initially hold no data, they 
stay temporarily idle during the schedule execution, waiting to receive some data to be able 
to process them. Let us consider processor P2. As the idleness of P2 is only temporary 
(all processors are working in an optimal solution), this processor is only idle because it is 
lacking data to process and it is waiting for some. Therefore, the last moment at which P2 
stays temporarily idle under S is the moment it finished to receive some data, namely the 
jth installment of load L„ sent to him by processor Pi . 

As previously, Qk is the number of installments of the load under S. Then from the 
schedule S we build a schedule S' by dividing in two identical halves the jth installment of 
load Ln. Formally: 

• All loads except L„ have the exact same installments under S' than under S. 

• The load L„ has (1 + Qn) installments under <S', defined as follows. 

• The first (j — 1) installments of L„ under S' are identical to the first (j — 1) installments 
of this load under S. 

• The jth and (j + l)th installment of L„ under S' are identical to the jth installment 
of Ln under S, except that all sizes are halved. 
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• The last (Q„ — j) installments of L„ under S' are identical to the last {Q„ — j) 
installments of this load under S. 

We must first remark that no eompletion time is increased by the transformation from 
iS to iS'. Therefore the makcspan of S' is no greater than the makespan of S. We denote by 
Commf^^^^ (respectively Commf^^j) the time at which processor Pi starts (resp. finishes) 

start 
'2,n,j 



sending to processor P2 the jth installment of load L„ under S. We denote by Camp: 
(respectively Comp'^^ ^) the time at which processor P2 starts (resp. finishes) computing 
the jth installment of load L„ under S. We use similar notations, with an added prime, for 
schedule S' . One can then easily derive the following properties: 

Comm' = CommlXj- (15) 



(16) 



Comm' f;Zl+i = Comm' Tl, = ^-^^ ^ 

Comm' ^;;f^^-+i = Comm'CX^ . (17) 
Gomp' = Comm! ^"4- . (18) 

Comp' = Comm! + ^"""^'""'^ ^ 
Comp' = max{Comp' "^^^^ Comm! "^^l^ 



^start 



(19) 



2 

+i}- (20) 



^ ,end ^ , start , Comp^^l^ - Compf^^j 

Comp' = Comp' + ^ 

Using equations 16, 17, 19, 20, and 21 we then establish that: 

, , I Commf!:!'\ + Comm'^"!^ . , , , 



(21) 



Comp' ™,f_^ = max <^ '-^^^ ^ + Comp^^l^ - Comp. 



Therefore, under schedule 5' processor P2 completes strictly earlier than under S the 
computation of what was the j installment of load L„ under S. If P2 is no more idle after 
the time Comp' l^^j, then it completes its overall work strictly earlier under S' than under 
S. On the other hand. Pi completes its work at the same time. Then, using the fact that in 
an optimal solution all processors finish simultaneously, we conclude that S' is not optimal. 
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As we have already remarked that its makespan is no greater than the makespan of 5, we 
end up with the contradiction that S is not optimal. Therefore, P2 must be idled at some 
time after the time Comp' j. Then we apply to S' the transformation we applied to S 
as many times as needed to obtain a contradiction. This process is bounded as the number 
of communications that processor P2 receives after the time it is idled for the last time is 
strictly decreasing when we transform the schedule S into the schedule S' . □ 
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